Sentiment analysis

Machine learning course project

Marta Kałużna, Szymon Czop

Sentiment analysis is mostly used to describe the emotions about the given topic. It can be anything, starting from reviews of the movies and twitters, ending with opinions written in social media by a certain person. It is a powerful tool to analyze current and future trends and opinions. In our report, we took on the workshop data from IMDb movie reviews. In the beginning, we made some data exploration and calculated some ratios to find any dependencies between variables (certain words in our case). We were trying to find out if there are words strictly connected with negative or positive emotions. To play with the number of occurrences of a certain word we used Zipf's law. Then we tried to find which approach of pre-processing data is the best for logistic regression or naive Bayes. In the end, we compared some common ML models in terms of accuracy in test data and their time of evaluation.

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
In [2]:
import nltk 
import string
import re
from nltk.corpus import stopwords
ps = nltk.PorterStemmer()
stopword = nltk.corpus.stopwords.words('english')
def clean_text(text):
    text_lc = "".join([word.lower() for word in text if word not in string.punctuation]) # remove puntuation
    text_rc = re.sub('[0-9]+', '', text_lc)
    tokens = re.split('\W+', text_rc)    # tokenization
    text = [ps.stem(word) for word in tokens if word not in stopword]  # remove stopwords and stemming
    return text
In [6]:
import re
from bs4 import BeautifulSoup
from nltk.tokenize import WordPunctTokenizer
tok = WordPunctTokenizer()

pat1 = r'@[A-Za-z0-9_]+'
pat2 = r'https?://[^ ]+'
combined_pat = r'|'.join((pat1, pat2))
www_pat = r'www.[^ ]+'
negations_dic = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not",
                "haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",
                "wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",
                "can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",
                "mustn't":"must not"}
neg_pattern = re.compile(r'\b(' + '|'.join(negations_dic.keys()) + r')\b')

def cleaner(text):
    soup = BeautifulSoup(text, 'lxml')
    souped = soup.get_text()
    try:
        bom_removed = souped.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        bom_removed = souped
    stripped = re.sub(combined_pat, '', bom_removed)
    stripped = re.sub(www_pat, '', stripped)
    lower_case = stripped.lower()
    neg_handled = neg_pattern.sub(lambda x: negations_dic[x.group()], lower_case)
    letters_only = re.sub("[^a-zA-Z]", " ", neg_handled)
    words = [x for x  in tok.tokenize(letters_only) if len(x) > 1]
    return (" ".join(words)).strip()

IMDb reviews

In [3]:
df = pd.read_csv("IMDB_sample.csv")

Loading and cleaning data

In [7]:
clean_texts = []
for i in range(0,df.shape[0]):                                                                    
    clean_texts.append(cleaner(df['review'][i]))
In [9]:
clean_df["text2"] =df['review'].apply(lambda x : clean_text(x))
Checking for NA values
In [26]:
from wordcloud import WordCloud

Creating wordclouds

Wordclouds are great as a decoration or headline in presentations, but they don't give us much information in the analysis. The main idea of them is to show which words were used mainly in a negative or positive meaning: the bigger word in the picture, the higher frequency in the data.

In [28]:
neg_rev = clean_df[clean_df.target == 0]
neg_string = []
for t in neg_rev.text:
    neg_string.append(t)
neg_string = pd.Series(neg_string).str.cat(sep=' ')

Negative wordcloud

Positive wordcloud

Most frequently words and their ratio between positives and negatives

In [41]:
from sklearn.feature_extraction.text import CountVectorizer
countVectorizer = CountVectorizer() 
countVectorizer2 = CountVectorizer(analyzer = clean_text) 
countVector = countVectorizer.fit_transform(clean_df['text'])
countVector2 = countVectorizer2.fit_transform(clean_df['text'])

Difference in the shapes of vectorizer matrices due to the different text cleaning:

In [60]:
[countVector.shape,countVector2.shape]
Out[60]:
[(7501, 45032), (7501, 30279)]
In [79]:
from IPython.core.display import HTML

def multi_table(table_list):
    ''' Acceps a list of IpyTable objects and returns a table which contains each IpyTable in a cell
    '''
    return HTML(
        '<table><tr style="background-color:white;">' + 
        ''.join(['<td>' + table._repr_html_() + '</td>' for table in table_list]) +
        '</tr></table>'
    )
In [82]:
matrix = countVector.toarray()
neg_matrix = matrix[clean_df.target == 0]
pos_matrix = matrix[clean_df.target == 1]
In [83]:
neg_tf = np.sum(neg_matrix,axis=0)
pos_tf = np.sum(pos_matrix,axis=0)
In [84]:
neg = np.squeeze(np.asarray(neg_tf))
pos = np.squeeze(np.asarray(pos_tf))
term_freq_df = pd.DataFrame([neg,pos],columns=countVectorizer.get_feature_names()).transpose()

As a comparison, we've considered two approaches to CountVectorizer: 1st - less cleaned data, 2nd - tokenization, removing stopwords and punctuations. We've printed 10 most common used words for those two methods. At first glance, we can notice that every word has almost the same number of negative and positive representations. But in 'non-cleaned' data, these words doesn't make sense in case of sentiment analysis.

Plots just confirmed what we've noticed above. The negative frequency of the words is almost the same as the positive one - especially when data is not cleaned. Most of the words are below 10000 on the first plot and below 2000 on the second one. From the second plot, we can remark that now there are more points which occur more often as a positive/negative word -> situation changes (for better) when we clean the data.

From the summary, we can note that R-squared statistics is almost 1, which in fact means that our data is really close to the fitted regression line.

Similar results: R-squared is lower than earlier, but still high.

ZIPFS Law

Zipf's law is all about the occurrences of a certain word in a text or spoken language. It turns out that the presence of a word with comparison to the one that is most frequent is approximately 1/n where n means the n-th place in usage frequency. So the second most common word will occur 1/2 as frequent as the first, third 1/3 as first and so on. In the part below we had some fun showing that this law is also true ( or approximately true ) for our data set

Red dashed lines represent the exact value of Zipf's function, the blue bars are frequences of words that occur in data set most commonly.

Taking the log-scale for the frequencies gives us a nice line. To make it more informative we added terms that occur in each segment

Same for more strictly cleared data

Further calculation of ratios

In this section, we're going to calculate different rates to find any dependencies.

In [139]:
from scipy.stats import hmean
from scipy.stats import norm
def normcdf(x):
    return norm.cdf(x, x.mean(), x.std())

Plot of the harmonic mean of rate CDF and frequency CDF (for less cleaned data).
If a point is closer to the upper left corner, it is more positive, and if it is closer to the bottom right corner, it is more negative.

Plot of the harmonic mean of rate CDF and frequency CDF (for cleaned data).
In both cases, it has created an interesting, almost symmetrical shape.

In [147]:
from bokeh.plotting import figure,output_file,show
from bokeh.io import output_notebook, show
from bokeh.models import LinearColorMapper
from bokeh.models import HoverTool
In [151]:
color_mapper = LinearColorMapper(palette='Inferno256', low=min(term_freq_df.pos_normcdf_hmean), high=max(term_freq_df.pos_normcdf_hmean))
p = figure(x_axis_label='neg_normcdf_hmean', y_axis_label='pos_normcdf_hmean')
p.circle('neg_normcdf_hmean','pos_normcdf_hmean',size=5,alpha=0.3,source=term_freq_df2,color={'field': 'pos_normcdf_hmean', 'transform': color_mapper})
hover = HoverTool(tooltips=[('token','@index')])
p.add_tools(hover)
show(p)

Testing models on our dataset

In [13]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from time import time
from sklearn.model_selection import train_test_split

We divided our dataset into the train and test set (in proportions 80:20). In both sets, there are about 50% of negative words and 50% of positive words.

In [14]:
x = clean_df.text
y = clean_df.target

SEED = 2020
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state=SEED)
print(f"Train test has {len(x_train)} etries where negative are {(len(x_train[y_train == 0])/len(x_train)) * 100}% \n and positive {(len(x_train[y_train == 1])/len(x_train)) * 100}% \n")
print(f"Test set has {len(x_test)} etries where negative are {(len(x_test[y_test== 0])/len(x_test)) * 100}% \n and positive {(len(x_test[y_test== 1])/len(x_test)) * 100}% \n")
Train test has 6000 etries where negative are 50.66666666666667% 
 and positive 49.333333333333336% 

Test set has 1501 etries where negative are 49.43371085942705% 
 and positive 50.56628914057295% 

In [174]:
def accuracy(pipeline,x_train,y_train,x_test,y_test):
    
    t0 = time()
    sentiment_fit = pipeline.fit(x_train, y_train)
    #accuracy_train = pipeline.score(x_train, y_train)
    y_pred = sentiment_fit.predict(x_test)
    train_test_time = time() - t0
    
    accuracy = accuracy_score(y_test, y_pred)

    print(f"Accuracy on test data {accuracy} \n")
    
    print(f"train and test time {train_test_time } ")
    print("-"*85)
    return accuracy, train_test_time
In [167]:
countVectorizer = CountVectorizer() 
countVectorizer2 = CountVectorizer(analyzer = clean_text) 
countVector = countVectorizer.fit_transform(clean_df['text'])
countVector2 = countVectorizer2.fit_transform(clean_df['text'])
lr = LogisticRegression()
In [179]:
n_features = np.arange(5000,30001,2500)
def nfeature_accuracy_checker(vectorizer=countVectorizer, n_features=n_features, stop_words=None, ngram_range=(1, 1), classifier=lr,analyzer = 'word'):
    
    result = []
    print(classifier,'\n')
    for n in n_features:
        vectorizer.set_params(stop_words=stop_words, max_features=n, ngram_range=ngram_range,analyzer = analyzer)
        checker_pipeline = Pipeline([
            ('vectorizer', vectorizer),
            ('classifier', classifier)
        ])
        nfeature_accuracy,tt_time = accuracy(checker_pipeline, x_train, y_train, x_test, y_test)
        result.append((n,nfeature_accuracy,tt_time))
    
    return result 

Logistic regression + unigram

We've compared accuracy on the train set of different data: with stop words, without stop words and cleaned one. It is clearly seen in the plot that data without stop words has the highest accuracy. What is more, there is no difference if we take 30000 or 25000 number of features.

Naive Bayes + unigram

In case of NB, the accuracy is the highest for the full cleaned data. The results for data with stopwords are definitely worse. But the difference between cleaned data and data without stop words is not relevant. Thus, in further steps, we won't use full cleaning - it isn't worthwile.

Comparison of 1-6grams on the data without stopwords + logistic regression

It's clearly visible that there aren't big differences between the methods.

Comparison of 1-6grams on the data without stopwords + Naive Bayes

This time, unigram has definitely worse results, but all the other methods look almost similar.

Comparison of TFIDF and CountVectorizer (Logistic regression)

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfdf = TfidfVectorizer()
In [197]:
feature_tune_tf_ug_pd = pd.DataFrame(feature_tune_tf_ug,columns=['nfeatures','validation_accuracy','train_test_time'])
feature_tune_tf_tg_pd = pd.DataFrame(feature_tune_tf_tg ,columns=['nfeatures','validation_accuracy','train_test_time'])
feature_tune_tf_sg_pd = pd.DataFrame(feature_tune_tf_sg ,columns=['nfeatures','validation_accuracy','train_test_time'])

Dotted plots are responding to TFIDF and line plots to CountVectorizer. Definitely, TFIDF works better on our dataset. Once again unigram has worse result, whereas 3-gram and 6-gram work the best.

Comparison of TFIDF and CountVectorizer (Naive Bayes)

In [203]:
feature_tune_tf_ug_nb_pd = pd.DataFrame(feature_tune_tf_ug_nb,columns=['nfeatures','validation_accuracy','train_test_time'])
feature_tune_tf_tg_nb_pd = pd.DataFrame(feature_tune_tf_tg_nb ,columns=['nfeatures','validation_accuracy','train_test_time'])
feature_tune_tf_sg_nb_pd = pd.DataFrame(feature_tune_tf_sg_nb ,columns=['nfeatures','validation_accuracy','train_test_time'])

For the NB classifier, unigrams look definitely worse in both cases. All the other methods work almost the same. Thus, taking into account this plot and the previous one, we will only consider 3-gram TFIDF in the last step of our sentiment analysis where we compare different methods for classification.

Logistic regression and words that make biggest impact on classification

Beacause logistic regression is fully interpretable, we took a look at the words that had the highest (positive) and the lowest (negative) estimators for the model.

In [112]:
logistic_regression = LogisticRegression()
logistic_refression = logistic_regression.fit(X,y_train)
words = vectorizer.get_feature_names()
lr_beta = np.ravel(logistic_refression.coef_)

TOP 100 negative words in sentiment analysis

TOP 100 positive words in sentiment analysis

Most of the terms that are clustered in groups of certain sentiment make sens. The ones that computer takes as negative would be classified in the same way by a human (in this case by us) - the same happens for positive. It means that our regression makes sense and will probably work properly for other texts.

Different methods for sentiment classification

In [217]:
names = ["Logistic Regression", "Linear SVC", "LinearSVC with L1-based feature selection","Multinomial NB", 
         "Bernoulli NB", "Ridge Classifier", "AdaBoost", "Perceptron","Passive-Aggresive", "Nearest Centroid"]

classifiers = [
    LogisticRegression(),
    LinearSVC(),
    Pipeline([
  ('feature_selection', SelectFromModel(LinearSVC(penalty="l1", dual=False))),
  ('classification', LinearSVC(penalty="l2"))]),
    MultinomialNB(),
    BernoulliNB(),
    RidgeClassifier(),
    AdaBoostClassifier(),
    Perceptron(),
    PassiveAggressiveClassifier(),
    NearestCentroid()
    ]
    
zipped_clf = zip(names,classifiers)

tvec = TfidfVectorizer()
In [218]:
def cls_compare(vectorizer=tvec, n_features=30000, stop_words=None, ngram_range=(1, 1), classifier=zipped_clf):
    result = []
    vectorizer.set_params(stop_words=stop_words, max_features=n_features, ngram_range=ngram_range)
    for n,c in classifier:
        checker_pipeline = Pipeline([
            ('vectorizer', vectorizer),
            ('classifier', c)
        ])
        print(f"Validation result for {n}")
        print(c)
        clf_accuracy,tt_time = accuracy(checker_pipeline, x_train, y_train, x_test, y_test)
        result.append((n,clf_accuracy,tt_time))
    return result
  • Perceptron (simple NN with weights)
  • Passive-aggresive (adaptive SVC)
  • NearestCentroid (knn with means)
In [219]:
cls_outcome = cls_compare(stop_words = 'english',ngram_range=(1, 3))
cls_imdb_score = pd.DataFrame(cls_outcome,columns = ['model','test accuracy','time'])
Validation result for Logistic Regression
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)
/Users/czoppson/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
Accuracy on test data 0.8607594936708861 

train and test time 11.535930871963501 
-------------------------------------------------------------------------------------
Validation result for Linear SVC
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)
Accuracy on test data 0.8734177215189873 

train and test time 10.448095798492432 
-------------------------------------------------------------------------------------
Validation result for LinearSVC with L1-based feature selection
Pipeline(memory=None,
     steps=[('feature_selection', SelectFromModel(estimator=LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l1', random_state=None, tol=0.0001,
     verbose=0),
        max_features=None, n...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])
Accuracy on test data 0.8520986009327115 

train and test time 10.025551795959473 
-------------------------------------------------------------------------------------
Validation result for Multinomial NB
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
Accuracy on test data 0.8474350433044637 

train and test time 9.510265827178955 
-------------------------------------------------------------------------------------
Validation result for Bernoulli NB
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
Accuracy on test data 0.8520986009327115 

train and test time 9.396116018295288 
-------------------------------------------------------------------------------------
Validation result for Ridge Classifier
RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
        max_iter=None, normalize=False, random_state=None, solver='auto',
        tol=0.001)
Accuracy on test data 0.8680879413724184 

train and test time 9.49723219871521 
-------------------------------------------------------------------------------------
Validation result for AdaBoost
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
Accuracy on test data 0.7688207861425717 

train and test time 13.89434003829956 
-------------------------------------------------------------------------------------
Validation result for Perceptron
Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0,
      fit_intercept=True, max_iter=None, n_iter=None, n_iter_no_change=5,
      n_jobs=None, penalty=None, random_state=0, shuffle=True, tol=None,
      validation_fraction=0.1, verbose=0, warm_start=False)
/Users/czoppson/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/stochastic_gradient.py:166: FutureWarning: max_iter and tol parameters have been added in Perceptron in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.
  FutureWarning)
Accuracy on test data 0.844103930712858 

train and test time 9.147571802139282 
-------------------------------------------------------------------------------------
Validation result for Passive-Aggresive
PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
              early_stopping=False, fit_intercept=True, loss='hinge',
              max_iter=None, n_iter=None, n_iter_no_change=5, n_jobs=None,
              random_state=None, shuffle=True, tol=None,
              validation_fraction=0.1, verbose=0, warm_start=False)
/Users/czoppson/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/stochastic_gradient.py:166: FutureWarning: max_iter and tol parameters have been added in PassiveAggressiveClassifier in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.
  FutureWarning)
Accuracy on test data 0.8620919387075283 

train and test time 8.73842978477478 
-------------------------------------------------------------------------------------
Validation result for Nearest Centroid
NearestCentroid(metric='euclidean', shrink_threshold=None)
Accuracy on test data 0.8014656895403065 

train and test time 8.799280881881714 
-------------------------------------------------------------------------------------

To make a somehow wider conclusion we tried a couple of models and compared their accuracy and time of evaluation. Most of the models have accuracy close or higher than 85% what is positively surprising, because NLP without NN is often useless. In this case, standard models are giving us decent scores in a really short time. Again, logistic regression seems to be one of the best choices. While SVC is performing better, it is not so easy to find the most negative or positive sentences for this classifier. In case of logistic regression, we did it right above. It is good to know that hours spent in the math faculty are not going to be wasted and we can create a simple model that we fully understand and fulfill our requirements about the given task.

In [ ]:
# perception are just weights that x@positive >=0 and x@negative < 0 
In [93]:
#checker_pipeline = Pipeline([
#       ('vectorizer', vectorizer),
#       ('classifier',LogisticRegression() )
#        ])
    
    
#sentiment_fit = checker_pipeline.fit(x_train, y_train)
#y_pred = sentiment_fit.predict(x_test)
#accuracy = accuracy_score(y_test, y_pred)
In [130]:
import pytreebank
import sys
import os
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score, accuracy_score
from sklearn.metrics import confusion_matrix

STANFORD 5 sentiment (SST-5)

To make things more complicated we also tried to work on a data set given by Stanford University. The hard part of the analysis is the fact that we have 5 classes, where 1 means very negative, 3 - neutral and 5 - very positive. We tried to find out if there is a possibility to create a basic model that will also have a decent outcome on this more sophisticated data. To somehow interpret the models if they match or not certain sentiments, we used confusion matrix and evaluate model according to its outcome.

In [132]:
def accuracy_clf(y_pred,y_true):
    "Prediction accuracy (percentage) and F1 score"
    acc = accuracy_score(y_true, y_pred)*100
    f1 = f1_score(y_true,y_pred, average='macro')
    print("Accuracy: {}\nMacro F1-score: {}".format(acc, f1))   

def plot_confusion_matrix(y_true, y_pred, 
                          classes=[1, 2, 3, 4, 5],
                          normalize=False,
                          cmap=plt.cm.YlOrBr):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    (Adapted from scikit-learn docs).
    """
    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', origin='lower', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # Show all ticks
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # Label with respective list entries
           xticklabels=classes, yticklabels=classes,
           ylabel='True label',
           xlabel='Predicted label')

    # Set alignment of tick labels
    plt.setp(ax.get_xticklabels(), rotation=0, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    return fig, ax

Both training and test set are not perfectly balanced in terms of class occurrences, but it shouldn't affect our models in terms of accuracy.

In [148]:
df_test = pd.read_csv('./sst_test.txt', sep='\t', header=None, names=['truth', 'text'])
df_test['truth'] = df_test['truth'].str.replace('__label__', '')
df_test['truth'] = df_test['truth'].astype(int).astype('category')

Text Blob

In [136]:
from textblob import TextBlob
In [138]:
accuracy_clf(df.pred_blob,df.truth)
Accuracy: 28.803838951310862
Macro F1-score: 0.23987349627597682

VADER

In [140]:
accuracy_clf(df.pred_vader,df.truth)
Accuracy: 30.957397003745317
Macro F1-score: 0.3042290062686328

LOGISTIC REGRESSION

In [142]:
accuracy_clf(df_test.pred_lr,df_test.truth)
Accuracy: 40.18099547511312
Macro F1-score: 0.3295860165192341

SVM

In [144]:
accuracy_clf(df_test.pred_svm,df_test.truth)
Accuracy: 41.22171945701357
Macro F1-score: 0.37968831901748645

Naive Bayes

In [146]:
accuracy_clf(df_test.pred_nb,df_test.truth)
Accuracy: 39.72850678733032
Macro F1-score: 0.22381085722177008

Conclusion

The accuracy score achieved by the first (Recursive Neural Tensor Network) model specially created for this data was about 45.5%. The logistic regression which is way simpler gave us about 40% and NB 39.7%. They are much easier and interpretable but it is up to us if we want better accuracy or explanation. All models struggle with classes that are close to each other, i.e. {1,2} or {4,5}. There is a lot of misclassification in that case. Neutral class (3) is almost always omitted and classified as 2 or 4. It is hard for a human to find out if a text has irony or if it is neutral, so for these classificators it is simply impossible to make the right decision. It is worth to mention that some models are very accurate for special cases, i.e. NB is perfect in finding very negative and positive comments.